-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44042: [C++][Parquet] Limit num-of row-groups when building parquet for encrypted file #44043
Conversation
|
Is erroring out the right solution? The row group @ggershinsky What is the intent here? Should |
Ok, the C++ reader always uses the physical row group number as the row group ordinal for AAD computation: arrow/cpp/src/parquet/file_reader.cc Lines 270 to 278 in b4b22a4
This begs the question: why is the row group ordinal stored in Thrift metadata if it's simply ignored when reading? |
And by the way, the same audit should be done for column and page ordinals. |
Isn't it a bug on the C++ side? It may produces wrong AAD when reading a Parquet file created by ParquetRewriter which binpacks several encrypted files. EDIT: My bad. ParquetRewriter does not support binpack encrypted files yet. |
Hopefully it's not possible to concatenate encrypted files together? Otherwise one can create malicious data (replay attack). |
The number of row groups (columns, pages) should be limited for encrypted files only. I don't think this exception should be thrown for unencrypted files. |
Yep, merging encrypted files is impossible, the AAD has a unique file id component. |
It seems to be optional and disabled by default according to https://github.com/apache/parquet-format/blob/master/Encryption.md#441-aad-prefix ? Regardless, my question was about ordinal reuse or shuffling. Should readers verify that ordinals correspond to the physical row group numbers? What is the intent exactly? The spec does not say how these should be handled. |
It's a different parameter,
The rg ordinals in thrift are a utility, which can be helpful for readers that split a single file into parallel reading threads. The readers can also run an rg loop/counter. The values will be the same, unless the file is tampered with. Note - the thrift footer is tamper-proof, so the rg ordinals are safe; but the row groups / pages can be shuffled by an attacker. Reading a page from a shuffled rg will throw an exception. |
Since we're talking about a security feature, it would be much better if the spec gave clear guidelines instead of letting implementers do the necessary guesswork (and potentially make errors). |
I do agree the spec often feels laconic, not only here but in other places too. Expanding it, with ample explanation and reasoning, would result in a different document (implementation guide), much longer than the spec. |
@ggershinsky Would you mind add a patch to parquet-format? Then I can move forward here |
Sure, here it goes apache/parquet-format#453 |
apache/parquet-format#453 is merged and I understand this feature is for encryption. However, still this value is written to the metadata, should I:
|
Perhaps a quick check that other Parquet readers only use the row group ordinal for encrypted files? |
@mapleFU @pitrou I'm trying to pick up this task to ramp up on the parquet codebase. Upon checking the parquet java implementation of this For the java implementation, there is this check whether it's an encrypted file or not by checking the For the cpp implementation, we don't really have a check for the encrypted file like the java implementation when we write row group ordinal (not that I've found, or maybe I'm missing something), so we always write this ordinal value even for unencrypted files. In my opinion, we should change the logic here to behave like the java implementation with an additional check on integer overflow. What do you both think? |
The idea LGTM |
cpp/src/parquet/file_reader.cc
Outdated
@@ -267,12 +267,13 @@ class SerializedRowGroup : public RowGroupReader::Contents { | |||
ARROW_DCHECK_NE(data_decryptor, nullptr); | |||
|
|||
constexpr auto kEncryptedRowGroupsLimit = 32767; | |||
if (i > kEncryptedRowGroupsLimit) { | |||
if (ARROW_PREDICT_FALSE(row_group_ordinal_ > kEncryptedRowGroupsLimit)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous check actually checks the column_id, not the row-group ordinal.
cpp/src/parquet/file_writer.cc
Outdated
@@ -359,14 +359,26 @@ class FileSerializer : public ParquetFileWriter::Contents { | |||
if (row_group_writer_) { | |||
row_group_writer_->Close(); | |||
} | |||
int16_t row_group_ordinal = 0; | |||
if (num_row_groups_ < std::numeric_limits<int16_t>::max()) { | |||
row_group_ordinal = static_cast<int16_t>(num_row_groups_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is a bit ugly. Maybe we can also not write the row_group_ordinal for un-encryption files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think there is no need to write it.
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
|
||
if (ccmd.__isset.ENCRYPTION_WITH_COLUMN_KEY) { | ||
if (file_decryptor != nullptr && file_decryptor->properties() != nullptr) { | ||
// should decrypt metadata | ||
std::shared_ptr<schema::ColumnPath> path = std::make_shared<schema::ColumnPath>( | ||
ccmd.ENCRYPTION_WITH_COLUMN_KEY.path_in_schema); | ||
std::string key_metadata = ccmd.ENCRYPTION_WITH_COLUMN_KEY.key_metadata; | ||
const std::string& key_metadata = ccmd.ENCRYPTION_WITH_COLUMN_KEY.key_metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wonder should we check row_group_ordinal >= 0
here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary, any invalid row group ordinal will fail the AAD check when reading.
|
||
if (ccmd.__isset.ENCRYPTION_WITH_COLUMN_KEY) { | ||
if (file_decryptor != nullptr && file_decryptor->properties() != nullptr) { | ||
// should decrypt metadata | ||
std::shared_ptr<schema::ColumnPath> path = std::make_shared<schema::ColumnPath>( | ||
ccmd.ENCRYPTION_WITH_COLUMN_KEY.path_in_schema); | ||
std::string key_metadata = ccmd.ENCRYPTION_WITH_COLUMN_KEY.key_metadata; | ||
const std::string& key_metadata = ccmd.ENCRYPTION_WITH_COLUMN_KEY.key_metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary, any invalid row group ordinal will fail the AAD check when reading.
constexpr auto kEncryptedRowGroupsLimit = 32767; | ||
if (i > kEncryptedRowGroupsLimit) { | ||
constexpr auto kEncryptedOrdinalLimit = 32767; | ||
if (ARROW_PREDICT_FALSE(row_group_ordinal_ > kEncryptedOrdinalLimit)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this is the best place to enforce this. Is it possible to check these while creating the FileMetaData
if encrypted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that if we don't request reading data from row_group_ordinal > 32767
and column_ordinal > 32767
, we will even not notice that it is a malformed encrypted file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean when contructing ColumnChunkMetaData::ColumnChunkMetaDataImpl
? What if the ENCRYPTION_WITH_COLUMN_KEY
is not set but this is being encrypted?
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f4a63d4. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 339 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Limit num-of row-groups when build parquet
What changes are included in this PR?
Limit num-of row-groups when build parquet
Are these changes tested?
No
Are there any user-facing changes?
No